1 

A DNA Sequence Compression Algorithm Based 

on LUT and LZ77 

Sheng Bao Student Member, IEEE, 
Dept. of Information Engineering, Nanjing Univ.of P.& T., Nanjing 210046,CHINA 
Email : forrestbao@yahoo.com.cn, 
Shi Chen 

School of Life Science,Nanjing University,Nanjing 210093,CHINA 
Email : gattacalab@gmail.com 
Zhiqiang Jing 

Gattaca Lab,School of Life Science,Nanjing University,Nanjing 210093,CHINA 
Email : jzq8255@sina.com 
Ran Ren 

Dept. of Telecommunication Engineering, Nanjing Univ. of R & T.,Nanjing 210046, CHINA 

Email : george.r.ren@gmail.com 



Abstract 

This article introduces a new DNA sequence compression algorithm which is based on LUT and LZ77 algorithm. 
Combined a LUT-based pre-coding routine and LZ77 compression routine, this algorithm can approach a compression 
ratio 1 of 1.9bits /base and even lower. The biggest advantage of this algorithm is fast execution, small memory occupation 
and easy implementation. 

Index Terms 

Biology and genetics, Data compaction and compression 

I. Introduction 

As Cohen[l] said on Communication of The ACM said,"Biologists are aware of the degree of difficulty in days, 
months, or years in validating a given conjecture by lab experiment. Computer scientists are sure to benefit from being 
active and assertive partners with biologists",DNA sequence compression has caught the attention of some computer 
and biology scientists. [2] [3] 

Many researches on DNA sequence compression are devised. Some of them use the property of repeat in DNA 
sequence. [2]We devised a new DNA sequence compression algorithm combining LZ77 and the pre-coding routine 
which maps the combination of ATGC into 64 ASCII characters. 

Since the essence of compression is a mapping between source file and destination file,the compression algorithm 
dedicates to find the relationship. In EE design,some logical circuits are implemented by FPGA. The key idea of 
FPGA is LUT(Look-up Table). Every input signal looks up its corresponding output which is stored in the chip 
before. We migrate this idea to our research on DNA sequence compression. We are trying to build a finite LUT 
which implements the mapping relationship of our coding process. 

Some experiments indicate that the compression ratio is 1.9bits/base and even better. [4] [5] [6] Our algorithm is 
just adding a pre-coding process before the LZ77 compression. Compared with other DNA compression algorithms,the 
biggest advantage is fast execution, small memory occupation and easy implementation. 

'This compression rate is defined as the compressed file size divided by base number 
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Look-up Table that we used 
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II. pre-coding Routine 

A. The Look-up Table 

The look-up table describes a mapping relationship between DNA segment and its corresponding characters. We 
assume that every three characters in source DNA sequence( without N 2 ) will be mapped into a character chosen 
from the character set which consists of 64 ASCII characters. [7]The look-up table is various. You can choose other 
character sets for coding. The one that we used is given in table The braces behind each character contains the 
corresponding ASCII codes of these characters. For easy implementation,characters a,t,g,c and A,T,G,C will no longer 
appear in pre-coded file. 

For instance,if a segment "ACTGTCGATGCC" has been read,in the destination file, we represent them as "j2X6". 
Obviously,the destination file is case-sensitive. 

B. Handling the N 

As what we mentioned above,the character N refers to the segment which is unknown. Our experience taught 
us that N doesn't appear singly. Usually,scores or hundreds of Ns appear together. It is necessary to consider this 
situation which happens frequently in sequences under investigation. 

When we encounter a serial of successive Ns, our algorithm inserts two "/" into destination file to identify the 
starting and end place of these successive Ns. There is a number which equals to the number of Ns between the "/" 
pair. For instance,if segment "NNNNNN" has been read,in the destination file,we represent them as "/6/". 

C. Segment which consists of less than 3 non-N bases 

Non-N base is defined as the base which is not N. Thus,they are A,T,G or C. In section ITl-Al we read bases 3 by 
3.But in some conditions,we can't read three successive non-N bases. For example,the segment "TCN" has been read. 
As what we mentioned,we handle "N" differently as non-N bases. Then how do you process the segment "TC"? You 
cannot find any arrangement in table Q which is "TC". In this circumstance,we just write the original segment into 
destination file. 

2 N refers to those not available or unknown base in DNA sequence 
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III. Algorithm steps 

This algorithm consists of two phases.The first one is from Step 1 to Step |S| and the second one is the last step. 

Step 1: read 3 most beginning unprocessed characters.If successful,go along to step 2,Otherwise(the EOF 3 is 
reached),process the last one or two characters by step [5] 

Step 2: judge whether there are all non-N characters.If it isjump to step [3]Otherwise Process N by step [4] and 
non-N bases by step [5] respectively. 

Step 3: code the three characters according to their arrangement by table [I] and write the coded character into 
destination file. 

Step 4: read along successive Ns, write "/n/" into destination file, where n is the number of successive Ns being 
processed. After thatjump to step [6] 

Step 5: write non-N characters whose number is less than three into destination file directly without any modifi- 
cation. After thatjump to step [3] 

Step 6: Return to step 1 and repeat all process until EOF is reached. 

Step 7: compress the output file by LZ77 algorithm [8] [9] 

Here is an example illustrates how the algorithm works." — " indicates different steps of the algorithm processing 
file.The segment in source file is 

ATG — CG — NNNNNNNNNNNNNNNNNNNN — ACC — GCC — ATC — TCT — CG — EOF 
The destination file of pre-coding routine is 

hCG/20/k6#zCG EOF 

The decompression routine is just the inverse operation of the compression routine. In [10], we introduce the most 
up-to-date C++ implementation of pre-coding algorithm,bofh coding and decoding ones. 

IV. Algorithm Evaluation 

Since LZ77 has been proved to be accurate and efficient by earlier papers and user practice,we only consider the 
pre-coding routine. 

A. Accuracy 

As to the DNA sequence storage, accuracy must be taken firstly in that even a single base mutation,insertion, 
deletion or SNP would result in huge change of phenotype as we see in the sicklemia. It is not tolerable that any 
mistake exists either in compression or in decompression. Although not yet proved mathematically, it could be infer 
from table0that our algorithm is accuracy, since every base arrangement uniquely corresponds to an ASCII character 
and all Ns will be mapped into a number braced by two "/". 

B. Efficiency 

You can see that the pre-coding algorithm can compress original file from 3 characters into 1 character for any 3 
non-N bases segment. And destination file uses less ASCII character to represent successive Ns than source file,if 
the length of Ns is greater than 3. In practice,the length of Ns often is much greater than 3. The more Ns the source 
file contains,the more efficient the algorithm will be. So we can infer the file size becomes small. 

C. Time elapsed 

Today many compression algorithms are highly desirable, but they require considerable time to execute. Some of 
them take much more time than LZ77.In Chen's paper[2], compression of two sequences by some newly-developed 
algorithm elapsed more time than the traditional LZ77.As our algorithm is based on a LUT rather than sequence 
statistics, it can save the time of obtaining statistic information of sequence. 

And more,after the pre-coding routine,the character number is 1/3 of source one.The LZ77 will take less time to 
operate it than operating the source one. 

3 EOF means end-of-file which is the identifier of end of a file 



4 



Our experiments in section[V]also provide proof to above conclusions. You can see the elapsed time of our algorithm 
is in 10~ 3 second level whereas time elapsed of many newly-developed algorithm mentioned in Chen's paper[2] is 
in second level. 4 Specially, the pre-coding routine only takes few 10 5 CPU cycles which means it takes 10 -4 second 
on a 1G CPU. 

Not only faster than those newly-developed algorithm,our algorithm is also faster than LZ77,the algorithm used 
in many bioinformatics databases. Tabel UTTI indicates our algorithm takes almost only 26% time of the one that Gzip 
needs. 

D. Space Occupation 

Our algorithm reads characters from source file and writes them immediately into destination file. It costs very small 
memory space to store only a few characters.The space occupation is in constant level. 

In our experiments,the OS has no swap partition.All performance can be done in main memory which is only 256 
MB on our PC. 

V. Compression Experiments 

Experiments are done to test our algorithm. Codes for testing the algorithm are continually revising. [1 1] [10] There 
should be some difference of experiment results between different versions of our codes. 

These tests are performed on a computer whose CPU is AMD Duron 750MHz and OS is MagicLinux 1.2 (Linux 
Kernel 2.6.9) without swap partition. Testing programs are executed at multiuser text mode and compiled by gcc 
3.3.2 without optimization. The file system where tests preformed is ext3 on a 4.3 GB Quantum Fireball hard disk 
with 5400 RPM(rounds per minute). Table ITU lists the results in file size while table 11111 lists the cost of executing 
time. Appendix list all source codes of our algorithm and Linux shell by which we used Gzip to apply LZ77 algorithm. 
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TABLE II 
Experiment result on file size 



In table IIIH "second phase" refers to the sum of time elapsed in both compression and decompression process 
with error 10 -3 second. "pre-coding CLKs" means the CPU clock cycle needed by compression in LUT pre-coding 
process while "decoding by LUT CLKs" stands for the CPU clock cycle needed in decompressing the file coded by 
LUT.The CPU clock cycle is counted by every 10000 cycles. Considering nowadays computers are run 10 9 cycles per 

4 our CPU is only 50MHz faster than the CPU that they used. 
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TABLE III 

Experiment result on executing time. 



second,the error is only 10~ 4 second."Gzip" means the time needed by both compress and decompress the original 
sequence file. 

In these experiments, we also have some new discoveries. 

Many tests indicate that the compression rate of only using LZ77 is almost the same as the one of using pre-coding 
algorithm. The last step,applying LZ77 on to the pre-coded file,only improves a little of compression rate. It just 
compresses the file size into almost 75% of the one which is only compressed by pre-coding routine. 

Considering the time cost by LZ77,the benefit is not so large. If the time elapsed is very important,a wise choice 
is that the last step should be skipped. 

VI. Advantage of our algorithm 

Compared with other algorithms,the compression ratio of our algorithm is 0.2bits/base higher than theirs generally. 
But the cost of the tiny 0.2 bits/base is very high.In Chen's paper[2],you can feel the time of program running since 
they are in second level.Some ones even cost minutes or hours of time to run.But our algorithm runs almost 10 3 time 
faster than them.Compared with present LZ77 algorithm used widely in bioinformatics,our algorithm performances 
better than it in both compression ratio and elapsed time. Our algorithm is very useful in database storing. You can 
keep sequences as records in database instead of maintaining them as files.By just using the pre-coding routine,users 
can obtain original sequences in a time that can't be felt. 

Additionally,our algorithm can be easily implemented while some of them will take you more time to program. 

VII. Conclusion 

In this article, we discussed a new DNA compression algorithm whose key idea is LUT.lt performance better than 
LZ77 in both compression ratio and elapsed time while the compression ratio is a little higher than newly-developed 
algorithms but many times faster than them. We are trying to do more work,such as combining our LUT pre-coding 
routine with other compression algorithms,to revise our algorithm in order to improve its performance. 
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Appendix I 

C++ IMPLEMETATION OF PRE-CODING PROCESS 

#include <iostream.h> 
#include <f stream. h> 
#include <string.h> 
#include <stdlib.h> 
#include <iomanip . h> 
#include <math.h> 
#define LENGTH 3 



using namespace std; 
using std::string; 

char LUT(int a []);/ /function used for pre-code regular segment 

int main (void) 

{ 

//define input file 
ifstream infile; 
string infilename; 
cout«"Enter the file name"; 
cin>>inf ilename; 

infile. open (infilename . c_str ( ) ) ; 

if ( ! infile) 

{ 

cout«"Can not open f ile"<<endl; 

} 

//end of defining input file 



//define output file 
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ofstream outfile; 

string outf ilename=inf ilename+" . LUT" ; 

outfile . open (outf ilename . c_str ( ) ) ; 
//end of defining output file 

int temp [LENGTH] ; //define temporary array for storing segment 
char ch; 

int i=0;//the number of bases in temporary array, minus 1 
int count=0 ;/ /count the base number in original file 
char enter; //code being writern into destination file 
int numN=0; //number of Ns which are read 

while (infile . get (ch) ) //omit the "ENTER" , whose ASCII is 10 
{ 

if ( (ch!='N' I |ch!='n')&& ch!=char (10) ) 
{ 

if ( (int) ch>96) 
{ 

int int ch= ( int ) ch-32 ; 
ch= (char) intch; 

}//change small case char to large case char . a->A, t->T, c->C, g->G 

temp [ i ] = (int ) ch; 
count++; 

if (i==2) 
{ 

//cout<<temp [0 ] <<temp [1] <<temp [2] «endl; 

enter=LUT (temp) ; 

// cout<<enter<<endl; 

outfile. put (enter) ; 

i = 0; 

} 

else//haven' t read three chars 
{ 

i++; 

} 
} 

if (ch==' N' t I ch==' n' ) //situation of N 
{ 

numN=l ; 

while (infile . get ( ch ) & & ( ch==' N' | | ch==' n' ) ) //if more N can be read 
{ 

//cout<<"numN="<<numN<<endl; 
numN++; 

} 

outf ile<<"/"<<numN<<"/"; 
i=l; 
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if ( (int) ch>96) 
{ 

int int ch= ( int ) ch-32 ; 
ch= (char) intch; 

} 

temp [ ]=( int ) ch; //the first non-N base will be stored in temp array 
//cout«"f irst non-N"<<temp [0] «endl; 

} 

//end of operate 

} 

//end of while 

if (i!=0) 

{ 

i — ; 

while (i ! =-1) 
{ 

enter= (char ) temp [ i ] ; 
outfile.put (enter) ; 

i — ; 

} 
} 

//output the last one or two chars 



//cout<<"temp="<<temp [0] <<" — "<<temp [1] <<" — "<<temp [2] <<endl; 
//cout<<"i="<<i<<endl; 
inf ile . close ( ) ; 
out file. close () ; 

cout<<"There are totally " <<setw ( 4 ) <<count<< " character ( s ) in the original file"<<endl; 
return 0; 

} 

//end of main function 



char LUT(int temp [3]) 
{ 

char enter; 

if (temp[0] !=78&&temp [1] ! = 7 8&Stemp [ 2 ] !=78) 
{ 

if (temp [0] ==65) //A 
{ 

if (temp[l]==65) {//A 

if (temp [2] ==65) {enter= (char) 33; } //A 
else if (temp [2 ] ==84) {enter= (char) 98; } 111 
else if (temp [2] ==67) {enter= (char) 34; } //C 
else if (temp[2]==71) { enter= (char ) 100; } //G 

} 

if (temp [1] ==84) {1/1 
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if (temp [2] ==65) { enter= (char ) 101; } //A 
else if (temp [2 ] ==84) {enter= (char) 102; } I IT 
else if (temp [ 2 ] ==67 ) { enter= (char ) 35 ; } IIG 
else if (temp [2] ==71) { enter= (char ) 104; } IIG 

} 

if (temp[l]==67) {IIG 

if (temp[2]==65) { enter= (char ) 105; } //A 
else if (temp[2]==84) { enter= (char ) 106; } I IT 
else if (temp [2] ==67) { enter= (char ) 107; } IIG 
else if (temp[2]==71) { enter= (char ) 108; } IIG 

} 

if (temp [1] ==71) { //G 

if (temp[2]==65) { enter= (char ) 109; } //A 
else if (temp [2] ==84) { enter= (char ) 110; } I IT 
else if (temp[2]==67) { enter= (char ) 111; } IIG 
else if (temp[2]==71) { enter= (char ) 112; } //G 

} 

} 

if (temp[0]==84) I IT 
{ 

if (temp[l]==65) {//A 

if (temp[2]==65) { enter= (char ) 113; } //A 
else if (temp [2 ] ==84) {enter= (char) 114; } I IT 
else if (temp[2]==67) { enter= (char ) 115; } //C 
else if (temp [2] ==71) {enter= (char) 36; } IIG 

) 

if (temp [1] ==84) {I IT 

if (temp[2]==65) { enter= (char ) 117; } //A 
else if (temp [2] ==84) { enter= (char ) 118; } I IT 
else if (temp [2] ==67) { enter= (char ) 119; } IIG 
else if (temp[2]==71) { enter= (char ) 120; } IIG 

) 

if (temp [1] ==67) { //C 

if (temp[2]==65) { enter= (char ) 121; } l/A 
else if (temp[2]==84) { enter= (char ) 122; } l/T 
else if (temp [2] ==67) {enter= (char) 37; } I/O. 
else if (temp[2]==71) { enter= (char ) 66; } IIG 

} 

if (temp [1] ==71) {IIG 

if (temp [2] ==65) {enter= (char) 38; } //A 
else if (temp [2] ==84) { enter= (char ) 68; } I IT 
else if (temp [2] ==67) { enter= (char ) 69; } IIG 
else if (temp [2] ==71) {enter= (char) 70; } IIG 

} 
} 



if (temp[0]==67) IIG 
{ 

if (temp[l]==65) {//A 
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if (temp [2] ==65) {enter= (char) 3 9; } //A 
else if (temp [2 ] ==84) {enter= (char) 72; } I IT 
else if (temp [2] ==67) {enter= (char) 73; } I/O. 
else if (temp [2] ==71) {enter= (char) 74; } l/C 

} 

if (temp[l]==84) {I IT 

if (temp [2] ==65) {enter= (char) 75; } //A 
else if (temp [2] ==84) {enter= (char) 76; } I IT 
else if (temp [2] ==67) {enter= (char) 77; } //C 
else if (temp [2] ==71) {enter= (char) 78; } //G 

} 

if (temp [1] ==67) {I/O. 

if (temp [2] ==65) {enter= (char) 7 9; } //A 
else if (temp [2] ==84) { enter= (char ) 80; } I IT 
else if (temp [2] ==67) { enter= (char ) 81; } l/C 
else if (temp[2]==71) { enter= (char ) 82; } //G 

} 

if (temp [1] ==71) {//G 

if (temp[2]==65) { enter= (char ) 83; } //A 
else if (temp [2] ==84) {enter= (char) 40; } I IT 
else if (temp [2] ==67) { enter= (char ) 85; } l/C 
else if (temp[2]==71) { enter= (char ) 86; } l/C 

} 

} 

if (temp [0] ==71) I IG 
{ 

if (temp[l]==65) {//A 

if (temp[2]==65) { enter= (char ) 87; } //A 
else if (temp[2]==84) { enter= (char ) 88; } I IT 
else if (temp[2]==67) { enter= (char ) 89; } l/C 
else if (temp [2] ==71) { enter= (char ) 90; } l/C 

} 

if (temp [1] ==84) {I IT 

if (temp [2] ==65) {enter= (char) 48; } //A 
else if (temp [2] ==84) {enter= (char) 49; } I IT 
else if (temp [2] ==67) {enter= (char) 50; } I/O. 
else if (temp [2] ==71) {enter= (char) 51; } l/C 

} 

if (temp[l]==67) {//C 

if (temp [2] ==65) {enter= (char) 52; } //A 
else if (temp [2] ==84) {enter= (char) 53; } I IT 
else if (temp [2] ==67) {enter= (char) 54; } l/C 
else if (temp [2] ==71) {enter= (char) 55; } //G 

} 

if (temp [1] ==71) {//G 

if (temp [2 ] ==65) {enter= (char) 56; } //A 
else if (temp [2] ==84) {enter= (char) 57; } l/T 
else if (temp [2] ==67) {enter= (char) 43; } l/C 
else if (temp [2] ==71) {enter= (char) 45; } l/C 
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} 
} 



} 

return enter; 
} 



Appendix II 

C++ IMPLEMENTATION OF DECODING FILE CODED BY PRE-CODING PROCESS 
#include <iostream.h> 
#include <fstream.h> 
#include <string.h> 
#include <stdlib.h> 
#include <iomanip.h> 
#include <math.h> 
tdefine LENGTH 3 



using namespace std; 
using std :: string; 



int main (void) 

{ 

//define input file 
ifstream infile; 
string infilename; 
cout«"Enter the file name"; 
cin>> infilename ; 

string outf ilename=inf ilename+" . de" ; 
/ /inf ilename=inf ilename+" . LUT" ; 
infile. open ( infilename . c_str ( ) ) ; 

if ( ! infile) 
{ 

cout«"Can not open file"<<endl; 

} 

//end of defining input file 



//define output file 
ofstream outfile; 

outf ile . open (outf ilename . c_str ( ) ) ; 
//end of defining output file 

int i, j=0 ; 
char ch; 
if ( ! infile) 

{ 

cout«"can not open file"<<endl; 
return -1; 

} 
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while (infile.get (ch) ) 
{ 

//infile.get (ch) ; 
if(ch!='/') //in case that the char is not ' /' 
{ 

switch ( (int ) ch) //according to looking-up table 
{ 



case 


33 




outf ile<<"AAA" 


break 


case 


98 




out f i le<< " AAT " 


break 


case 


34 




outf ile<<"AAC " 


break 


case 


100 


outf ile<<"AAG" 


break 


case 


101 


outf ile<<"ATA" 


break 


case 


102 


outf ile<<"ATT" 


break 


case 


35 




outf ile<<"ATC" 


break 


case 


104 


outf ile<<"ATG" 


break 


case 


105 


outf ile<<"ACA" 


break 


case 


106 


outf ile<<"ACT" 


break 


case 


107 


outf ile<<"ACC" 


break 


case 


108 


outf ile<<"ACG" 


break 


case 


109 


outf ile<<"AGA" 


break 


case 


110 


outf ile<<"AGT" 


break 


case 


111 


outf ile<<"AGC" 


break 


case 


112 


outf ile<<"AGG" 


break 


case 


113 


outf ile<<"TAA" 


break 


case 


114 


outf ile<<"TAT" 


break 


case 


115 


outf ile<<"TAC " 


break 


case 


36 




outf ile<<"TAG" 


break 


case 


117 


outf ile<<"TTA" 


break 


case 


118 


outf ile<<"TTT" 


break 


case 


119 


outf ile<<"TTC" 


break 


case 


120 


outf ile<<"TTG" 


break 


case 


121 


outf ile<<"TCA" 


break 


case 


122 


outf ile<<"TCT" 


break 


case 


37 




outf ile<<"TCC" 


break 


case 


66 




outf ile<<"TCG" 


break 


case 


38 




outf ile<<"TGA" 


break 


case 


68 




outf ile<<"TGT" 


break 


case 


69 




outf ile<<"TGC" 


break 


case 


70 




outf ile<<"TGG" 


break 


case 


39 




outf ile<<"CAA" 


break 


case 


72 




outf ile<<"CAT " 


break 


case 


73 




outf ile<<"CAC " 


break 


case 


74 




outf ile<<"CAG" 


break 


case 


75 




outf ile<<"CTA" 


break 


case 


76 




outf ile<<"CTT" 


break 


case 


77 




outf ile<<"CTC" 


break 


case 


78 




outf ile<<"CTG" 


break 


case 


79 




outf ile<<"CCA" 


break 


case 


80 




outf ile<<"CCT" 


break 


case 


81 




outf ile<<"CCC" 


break 
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case 82 
case 83 
case 4 : 
case 85 
case 86 
case 87 : 
case 88 
case 89 
case 90: 
case 48 
case 49 
case 50: 
case 51 
case 52 
case 53: 
case 54 
case 55 
case 56 
case 57 
case 43 
case 45 
case 65 
case 84 
case 67 
case 71 
} 

/ / outf ile 



outf ile<< 
outf ile<< 
outf ile<< 
outf ile<< 
outf ile<< 
outf ile<< 
outf ile<< 
outf ile<< 
outf ile<< 
outf ile<< 
outf ile<< 
outf ile<< 
outf ile<< 
outf ile<< 
outf ile<< 
outf ile<< 
outf ile<< 
outf ile<< 
outf ile<< 
outf ile<< 
outf ile<< 
outf ile<< 
outf ile<< 
outf ile<< 
outf ile<< 



"CCG" 
"CGA " 
" CGT " 
"CGC" 
"CGG" 
" GAA " 
" GAT " 
"GAC " 
"GAG" 
"GTA" 
" GTT " 
"GTC" 
"GTG" 
"GCA" 
" GCT " 
"GCC" 
"GCG" 
"GGA " 
" GGT " 
"GGC" 
"GGG" 



break 
break 
break 
break 
break 
break 
break 
break 
break 
break 
break 
break 
break 
break 
break 
break 
break 
break 
break 
break 
break 



"A", • break; 
" T"; break , • 
" C"; break , • 
"G"; break; 



//there's 'N's nearby 



write ( s_enter , j ) ; 



else //case that "/ns/" 
{ 

inf ile . get (ch) ; 
for (i=0, j=0; ch!=' /' ; ) 
{ 

j=10*j; 

j+=(int) (ch)-48; //accouting the number's of ' N " 
i++; 

inf ile . get (ch) ; 



for (i=l; i<=j; i++) 
outf ile«' N' ; 
} 

} 



inf ile . close ( ) ; 
out file. close () ; 
return 0; 

} 
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Appendix III 

Linux Shell that we used to test LZ77 performance 

# ! /bin/sh 

echo "please input a file name" 

read file_name 

date 

num=l 

while [ $num -It 1000 ] 
do 

gzip $file_name 
gzip -d * . gz 
num=$ ( (num+1 ) ) 

done 
date 



